AITopics | text-to-image synthesis

fa64505ebdc94531087bc81251ce2376-Paper-Conference.pdf

Neural Information Processing SystemsMay-1-2026, 06:35:29 GMT

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
(2 more...)

Add feedback

fa64505ebdc94531087bc81251ce2376-Supplemental-Conference.pdf

Neural Information Processing SystemsMay-1-2026, 05:15:24 GMT

In this work, we investigate the task of text-to-image (T2I) synthesis under the abstract-to-intricate setting, i.e., generating intricate visual content from simple abstract text prompts. Inspired by human imagination intuition, we propose a novel scene-graph hallucination (SGH) mechanism for effective abstract-to-intricate T2I synthesis. SGH carries out scene hallucination by expanding the initial scene graph (SG) of the input prompt with more feasible specific scene structures, in which the structured semantic representation of SG ensures high controllability of the intrinsic scene imagination. To approach the T2I synthesis, we deliberately build an SG-based hallucination diffusion system. First, we implement the SGH module based on the discrete diffusion technique, which evolves the SG structure by iteratively adding new scene elements. Then, we utilize another continuous-state diffusion model as the T2I synthesizer, where the overt image-generating process is navigated by the underlying semantic scene structure induced from the SGH module. On the benchmark COCO dataset, our system outperforms the existing best-performing T2I model by a significant margin, especially improving on the abstract-to-intricate T2I generation. Further in-depth analyses reveal how our methods advance.2

Add feedback

fa64505ebdc94531087bc81251ce2376-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 02:01:52 GMT

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
(2 more...)

Add feedback

Shengqiong Wu1 Hao Fei

Neural Information Processing SystemsFeb-18-2026, 02:01:47 GMT

In this work, we investigate the task of text-to-image (T2I) synthesis under the abstract-to-intricate setting, i.e., generating intricate visual content from simple

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Asia > Singapore (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
(2 more...)

Add feedback

LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation Y ujie Lu

Neural Information Processing SystemsFeb-11-2026, 08:01:20 GMT

Notably, our LLMScore achieves Kendall's

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
(4 more...)

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

StyleDrop: Text-to-Image Synthesis of Any Style

Neural Information Processing SystemsDec-26-2025, 21:03:28 GMT

Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language, and out-of-distribution effects make it hard to synthesize arbitrary image styles, leveraging a specific design pattern, texture or material. In this paper, we introduce, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. StyleDrop is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. StyleDrop works by efficiently learning a new style by fine-tuning very few trainable parameters (less than 1\% of total model parameters), and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a image specifying the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. More results are available at our project website: https://styledrop.github.io .

name change, styledrop, text-to-image synthesis, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

Compositional Image Synthesis with Inference-Time Scaling

Ji, Minsuk, Lee, Sanghyeok, Ahn, Namhyuk

arXiv.org Artificial IntelligenceOct-29-2025

ABSTRACT Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge re-ranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. Index T erms-- text-to-image synthesis, inference-time-scaling, object-centric 1. INTRODUCTION Text-to-image (T2I) diffusion models now deliver striking realism and diversity from textual prompts [1, 2, 3, 4], yet they still struggle with compositionality: the precise rendering of object counts, attributes, and spatial relations [5].

diffusion model, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2510.24133

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

47f30d67bce3e9824928267e9355420f-Paper-Conference.pdf

Neural Information Processing SystemsOct-8-2025, 14:57:36 GMT

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
(4 more...)

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.76)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

Implicit Inversion turns CLIP into a Decoder

D'Orazio, Antonio, Briglia, Maria Rosaria, Crisostomi, Donato, Loi, Dario, Rodolà, Emanuele, Masi, Iacopo

arXiv.org Artificial IntelligenceJun-5-2025

CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.

artificial intelligence, clip 1, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.23161

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)

Add feedback

Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis

Neural Information Processing SystemsMay-27-2025, 21:41:23 GMT

Although text-to-image (T2I) models exhibit remarkable generation capabilities,they frequently fail to accurately bind semantically related objects or attributesin the input prompts; a challenge termed semantic binding. Previous approacheseither involve intensive fine-tuning of the entire T2I model or require users orlarge language models to specify generation layouts, adding complexity. In thispaper, we define semantic binding as the task of associating a given object with itsattribute, termed attribute binding, or linking it to other related sub-objects, referredto as object binding. We introduce a novel method called Token Merging (ToMe),which enhances semantic binding by aggregating relevant tokens into a singlecomposite token. This ensures that the object, its attributes and sub-objects all sharethe same cross-attention map.

artificial intelligence, token merging, training-free semantic binding, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.85)

Add feedback

Filters

Collaborating Authors

text-to-image synthesis

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

fa64505ebdc94531087bc81251ce2376-Paper-Conference.pdf

fa64505ebdc94531087bc81251ce2376-Supplemental-Conference.pdf

fa64505ebdc94531087bc81251ce2376-Supplemental-Conference.pdf

Shengqiong Wu1 Hao Fei

LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation Y ujie Lu

StyleDrop: Text-to-Image Synthesis of Any Style

Compositional Image Synthesis with Inference-Time Scaling

47f30d67bce3e9824928267e9355420f-Paper-Conference.pdf

Implicit Inversion turns CLIP into a Decoder

Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis